prefill stage
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models
Zhang, Junyang, Zhu, Tianyi, Luo, Cheng, Anandkumar, Anima
Long-context language models exhibit impressive performance but remain challenging to deploy due to high GPU memory demands during inference. We propose Memory-efficient Offloaded Mini-sequence Inference (MOM), a method that partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. Experiments on various Llama, Qwen, and Mistral models demonstrate that MOM reduces peak memory usage by over 50\% on average. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU, while keeping outputs identical and not compromising accuracy. MOM also maintains highly competitive throughput due to minimal computational overhead and efficient last-layer processing. Compared to traditional chunked prefill methods, MOM achieves a 35\% greater context length extension. More importantly, our method drastically reduces prefill memory consumption, eliminating it as the longstanding dominant memory bottleneck during inference. This breakthrough fundamentally changes research priorities, redirecting future efforts from prefill-stage optimizations to improving decode-stage residual KV cache efficiency.
- North America > United States > California (0.05)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization
Pang, Bowen, Li, Kai, She, Ruifeng, Wang, Feifan
--With the development of large language models (LLMs), it has become increasingly important to optimize hardware usage and improve throughput. In this paper, we study the inference optimization of the serving system that deploys LLMs. T o optimize system throughput and maximize hardware utilization, we formulate the inference optimization problem as a mixed-integer programming (MIP) model and propose a hybrid offline-online method as solution. The offline method improves large-scale inference systems by introducing a Minimizing Makespan Bin Packing Problem. We further provide a theoretical lower bound computation method. Then, we propose an online sorting and preemptive scheduling method to better utilize hardware. In the online iteration scheduling process, a Lagrangian method is applied to evaluate the cost efficiency of inserting prefill stages versus decode stages at each iteration and dynamically determine when to preempt decoding tasks and insert prefill tasks. Experiments using real-world data from the LLaMA-65B model and the GSM8K dataset demonstrate that system utilization improves from 80.2% to 89.1%, and the total inference time decreases from 201.00 to 190.58 seconds. A 100-cases study shows that our method consistently outperforms the baseline method and improves the utilization rate by 8.0% on average. Finally, we discuss potential future extensions, including stochastic modeling, reinforcement learning-based schedulers, and dynamic decision-making strategies for system throughput and hardware utilization. Note to Practitioners --This work provides optimization tools for enhancing the efficiency of LLM inference systems through advanced scheduling techniques. From the perspective of LLM inference service providers, improved hardware utilization can reduce operational costs by requiring less hardware to maintain the same level of service. From the user's perspective, reduced inference time translates to faster response times and improved service quality. Furthermore, the proposed scheduling techniques are adaptable to various LLM models, hardware platforms, and datasets, making them highly scalable and broadly applicable to real-world LLM inference scenarios. Recent advancements in large language models (LLMs), including GPT -4, LLaMA, and Qwen, have significantly transformed the landscape of natural language processing by enabling more sophisticated text generation, comprehension, and interaction capabilities. These models serve as founda-tional technologies in a wide range of applications, such as chatbots, machine translation, and content creation. She are with Noah's Ark Lab, Huawei.
- Asia > China > Beijing > Beijing (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- (4 more...)
FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
Jo, Dongwon, Song, Jiwon, Kim, Yulhwa, Kim, Jae-Joon
While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00$\times$ and 1.40$\times$ improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at https://github.com/dongwonjo/FastKV.
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > South Korea > Gyeonggi-do > Suwon (0.04)
Efficiently serving large multimedia models using EPD Disaggregation
Singh, Gursimran, Wang, Xinglu, Hu, Ivan, Yu, Timothy, Xing, Linzi, Jiang, Wei, Wang, Zhefeng, Bai, Xiaolong, Li, Yi, Xiong, Ying, Zhang, Yong, Fan, Zhenan
Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step helps convert raw inputs into tokenized representations that inflate the token sequence for the prefill phase, negatively impacting key Service Level Objectives (SLOs) like time to first token (TTFT) and end-to-end throughput. We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our disaggregation approach alleviates memory bottlenecks, mitigates synchronization delays, and supports flexible batching. Specifically, we employ a new caching mechanism for multimodal tokens, enabling asynchronous transfer of multimodal tokens and introduce an integrated module to find optimal config for EPD system and minimize resource usage while maximizing SLO-based performance metric. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15$\times$ lesser for encoding-stage GPUs), that supports upto 22$\times$ higher batch sizes, 10$\times$ more number of images/ request, 2.2$\times$ higher kv cache size. Further, it leads to significant improvements in end-to-end throughput (up to 57\% better), and latency metrics (TTFT up to 71\% lower), compared to systems that do not disaggregate. Our findings underscore the potential of EPD disaggregation to enable resource-efficient and high-performance multimodal inference at scale.
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
Huang, Wenxuan, Zhai, Zijie, Shen, Yunhang, Cao, Shaosheng, Zhao, Fei, Xu, Xiangfeng, Ye, Zheyu, Lin, Shaohui
Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision understanding, reasoning, and interaction. However, the inference computation and memory increase progressively with the generation of output tokens during decoding, directly affecting the efficacy of MLLMs. Existing methods attempt to reduce the vision context redundancy to achieve efficient MLLMs. Unfortunately, the efficiency benefits of the vision context reduction in the prefill stage gradually diminish during the decoding stage. To address this problem, we proposed a dynamic vision-language context sparsification framework Dynamic-LLaVA, which dynamically reduces the redundancy of vision context in the prefill stage and decreases the memory and computation overhead of the generated language context during decoding. Dynamic-LLaVA designs a tailored sparsification inference scheme for different inference modes, i.e., prefill, decoding with and without KV cache, to achieve efficient inference of MLLMs. In practice, Dynamic-LLaVA can reduce computation consumption by $\sim$75\% in the prefill stage. Meanwhile, throughout the entire generation process of MLLMs, Dynamic-LLaVA reduces the $\sim$50\% computation consumption under decoding without KV cache, while saving $\sim$50\% GPU memory overhead when decoding with KV cache, due to the vision-language context sparsification. Extensive experiments also demonstrate that Dynamic-LLaVA achieves efficient inference for MLLMs with negligible understanding and generation ability degradation or even performance gains compared to the full-context inference baselines. Code is available at https://github.com/Osilly/dynamic_llava .
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Fujian Province > Xiamen (0.04)
ProMoE: Fast MoE-based LLM Serving using Proactive Caching
Song, Xiaoniu, Zhong, Zihang, Chen, Rong
The promising applications of large language models are often constrained by the limited GPU memory capacity available on edge devices. Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model's parameters during computation, allowing the unused parameters to be offloaded to host memory and reducing overall GPU memory demand. However, existing cache-based offloading solutions handle cache misses reactively and significantly impact system performance. In this paper, we propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage. By proactively fetching experts in advance, ProMoE removes the loading time from the critical path and diminishes the performance overhead of offloading. Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively, compared to existing offloading solutions.
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- (4 more...)
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Qin, Ruoyu, Li, Zheming, He, Weiran, Zhang, Mingxing, Wu, Yongwei, Zheng, Weimin, Xu, Xinran
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)